Univariate Plots Section

##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

可以看到有1599条样本,每条样本有13个变量

绘制所有变量的直方图

绘制quality的直方图

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18

可以看到,大部分(82%)的quality评分在5,6分

绘制fixed.acidity的直方图

## 
##  7.2  7.1  7.8  7.5    7  7.7  6.8  7.6  8.2  7.3  7.4  7.9    8  8.3  6.9 
##   67   57   53   52   50   49   46   46   45   44   44   42   42   40   38 
##  6.6  8.8  8.9  9.1  6.7  8.6  8.1  8.4    9  9.9  6.4  8.7   10  9.3 10.4 
##   37   34   33   29   28   27   26   26   26   26   25   24   23   22   21 
##  6.2  8.5 10.2  6.5  9.4  9.6  6.1  9.2  9.8  5.6  6.3  9.5 10.6    6 11.5 
##   20   19   19   17   17   17   16   16   15   14   14   14   14   13   13 
## 10.5 11.6 11.9 10.3 10.1 10.7 10.8  5.9  9.7 11.1 10.9 11.3   12 12.5    5 
##   12   12   12   11   10   10   10    9    9    9    8    7    7    7    6 
##  5.2  5.4 11.2 11.4 12.3 12.8  5.1  5.3  5.8 12.2 12.4 12.6 12.7   11 11.7 
##    6    5    5    5    5    5    4    4    4    4    4    4    4    3    3 
## 11.8   13 13.2 13.3  5.7 12.9 13.7   15 15.5 15.6  4.6  4.7  4.9  5.5 12.1 
##    3    3    3    3    2    2    2    2    2    2    1    1    1    1    1 
## 13.4 13.5 13.8   14 14.3 15.9 
##    1    1    1    1    1    1

可以看到,fixed.acidity的峰值出现在7.2,在16附近出现了一些异常值

绘制volatile.acidity的直方图

## 
##   0.6   0.5  0.43  0.59  0.36  0.58   0.4  0.38  0.39  0.49  0.56  0.41 
##    47    46    43    39    38    38    37    35    35    35    34    33 
##  0.52  0.42  0.46  0.54  0.31  0.34  0.53  0.63  0.57  0.61  0.64  0.66 
##    33    31    31    31    30    30    29    29    28    27    27    26 
##  0.37  0.48  0.51  0.62  0.28  0.32  0.44  0.67  0.69  0.35  0.45  0.47 
##    24    24    24    24    23    23    23    23    23    22    22    21 
##  0.33  0.55  0.26  0.29   0.3  0.65  0.27  0.24 0.645  0.68 0.715 0.685 
##    20    20    16    16    16    16    14    13    12    12    12    11 
##  0.74  0.18   0.7  0.78 0.635 0.725 0.735 0.785  0.84  0.25 0.655 0.695 
##    11    10    10    10     9     9     8     8     8     7     7     7 
##  0.21  0.22 0.615 0.705  0.73  0.75  0.77  0.23 0.545  0.72 0.745  0.76 
##     6     6     6     6     6     6     6     5     5     5     5     5 
## 0.765  0.82  0.88 0.885 0.775  0.83 0.835  0.87 0.915  1.02  0.12   0.2 
##     5     5     5     5     4     4     4     4     4     4     3     3 
## 0.415 0.575 0.585 0.605 0.625 0.665 0.675  0.71 0.755   0.8 0.815 0.855 
##     3     3     3     3     3     3     3     3     3     3     3     3 
##   0.9  0.91  0.96 0.965  0.98     1  1.04  0.16  0.19 0.305 0.315 0.365 
##     3     3     3     3     3     3     3     2     2     2     2     2 
## 0.395 0.475  0.79 0.795  0.81  0.85  0.86 0.875 0.935  1.33 0.295 0.565 
##     2     2     2     2     2     2     2     2     2     2     1     1 
## 0.595 0.805 0.825 0.845 0.865  0.89 0.895  0.92  0.95 0.955 0.975 1.005 
##     1     1     1     1     1     1     1     1     1     1     1     1 
##  1.01 1.025 1.035  1.07  1.09 1.115  1.13  1.18 1.185  1.24  1.58 
##     1     1     1     1     1     1     1     1     1     1     1

volatile.acidity的峰值出现在0.6, 在1.6左右出现了异常值

移除1%的异常值,再次绘制直方图

出现了近似对称的双峰直方图

绘制citric.acid的直方图

## 
##    0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
##  132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
## 0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
##   19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
##  0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
##   30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
## 0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
##   22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
##  0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
##    9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
## 0.75 0.76 0.78 0.79    1 
##    1    3    1    1    1

发现132个0值,和一个为1的异常值,这是一个多峰分布

绘制residual.sugar的直方图

## 
##    2  2.2  1.8  2.1  1.9  2.3  2.4  2.5  2.6  1.7  1.6  2.8  2.7  1.4  1.5 
##  156  131  129  128  117  109   86   84   79   76   58   49   39   35   30 
##    3  2.9  3.2  3.4  3.3    4  1.2  3.6  3.8  4.3  5.5  3.1  3.9  4.1  4.6 
##   25   24   15   15   11   11    8    8    8    8    8    7    6    6    6 
##  5.6  1.3  4.2  5.1  3.7  4.4  4.5  5.8    6  6.1  4.8  5.2  5.9  6.2  6.4 
##    6    5    5    5    4    4    4    4    4    4    3    3    3    3    3 
##  7.9  8.3  0.9 1.65 1.75 2.05 2.15  3.5 4.65  6.3 6.55  6.6  6.7  7.8  8.1 
##    3    3    2    2    2    2    2    2    2    2    2    2    2    2    2 
##  8.8   11 13.8 15.4 2.25 2.35 2.55 2.65 2.85 2.95 3.45 3.65 3.75 4.25  4.7 
##    2    2    2    2    1    1    1    1    1    1    1    1    1    1    1 
##    5 5.15  5.4  5.7    7  7.2  7.3  7.5  8.6  8.9    9 10.7 12.9 13.4 13.9 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
## 15.5 
##    1

峰值出现在2,有很长的长尾

对residual.sugar做对数变换,然后再次绘制直方图

绘制chlorides的直方图

## 
##  0.08 0.074 0.076 0.078 0.084 0.071 0.077 0.082 0.075 0.079 0.081  0.07 
##    66    55    51    51    49    47    47    46    45    43    40    35 
## 0.073 0.083 0.066 0.088 0.086 0.068 0.067 0.085 0.087 0.089 0.062 0.072 
##    35    35    32    32    31    30    27    25    25    25    24    24 
## 0.065 0.095 0.063 0.092 0.069  0.09 0.093 0.064 0.091 0.094 0.096 0.097 
##    23    23    22    22    21    21    21    20    19    19    18    18 
## 0.059  0.06 0.104 0.058 0.054   0.1  0.05 0.098 0.061 0.114 0.052 0.057 
##    17    16    16    14    13    13    12    12    11    11    10    10 
## 0.102 0.056 0.107 0.048 0.049 0.055 0.099 0.106  0.11 0.118 0.103 0.111 
##    10     9     9     8     8     8     8     8     8     8     7     7 
## 0.122 0.105 0.112 0.123 0.044 0.053 0.101 0.115 0.039 0.041 0.045 0.046 
##     7     6     6     6     5     5     5     5     4     4     4     4 
## 0.047 0.117 0.132 0.042 0.109 0.119  0.12 0.124 0.157 0.166 0.214 0.415 
##     4     4     4     3     3     3     3     3     3     3     3     3 
## 0.012 0.038 0.116 0.121 0.152 0.171 0.178 0.205 0.226 0.414 0.034 0.043 
##     2     2     2     2     2     2     2     2     2     2     1     1 
## 0.051 0.108 0.113 0.125 0.126 0.127 0.128 0.136 0.137 0.143 0.145 0.146 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.147 0.148 0.153 0.159 0.161 0.165 0.168 0.169  0.17 0.172 0.174 0.176 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.186  0.19 0.194   0.2 0.213 0.216 0.222  0.23 0.235 0.236 0.241 0.243 
##     1     1     1     1     1     1     1     1     1     1     1     1 
##  0.25 0.263 0.267  0.27 0.332 0.337 0.341 0.343 0.358  0.36 0.368 0.369 
##     1     1     1     1     1     1     1     1     1     1     1     1 
## 0.387 0.401 0.403 0.413 0.422 0.464 0.467  0.61 0.611 
##     1     1     1     1     1     1     1     1     1

峰值处在在0.08,有很长的长尾

对chlorides做对数变换,然后再次绘制直方图

绘制free.sulfur.dioxide的直方图

## 
##   6   5  10  15  12   7 
## 138 104  79  78  75  71

free.sulfur.dioxide峰值出现在6,有长尾并出现了一些异常值

绘制total.sulfur.dioxide的直方图

## 
## 28 24 15 18 23 14 
## 43 36 35 35 34 33

free.sulfur.dioxide峰值出现在28,有长尾并出现了一些异常值。他和free.sulfur.dioxide分布类似,我觉得这两个变量具有相关性。

绘制density的直方图

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

正态分布,中位数0.9968,均值0.9967

绘制pH的直方图

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

正态分布,中位数3.310,均值3.311

绘制sulphates的直方图

有长尾,并且有异常值,用对数转换为近似正态分布,峰值出现在0.6附近

绘制alcohol的长尾

## 
##  9.5  9.4  9.8  9.2   10 10.5 
##  139  103   78   72   67   67

峰值出现在9.5,这个直方图的形状类似total.sulfur.dioxide和free.sulfur.dioxide

Univariate Analysis

What is the structure of your dataset?

这个样本集有1599条样本,每条样本有13个变量。有一个quality的因子变量,范围从1到10 1. 变量citric.acid含有大量的0值 2. 变量density和pH服从正态分布 3. 变量residual.sugar,chlorides和sulphates有很长的长尾 4. 大部分(82%)的quality评分在5,6分

What is/are the main feature(s) of interest in your dataset?

主要关心quality变量,想知道有哪些因素影响这个变量

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

我对volatile acid和citric acid感兴趣。我了解到少量的citric acid能够增强红酒的口感,而高的volatile acid会降低红酒的口感。我还猜测residual sugar,free/total sulfur dioxide和alcohol也会影响红酒的quality。这些猜测在下面的分析中将被证实。

Did you create any new variables from existing variables in the dataset?

我为quality创建了factor变量。用它来做红酒quality的分类。

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

residual sugar, chlorides和sulphates变量表现出右偏并且带有长尾,我用log算法对他们进行了转换。 log算法会让异常值更接近median。转换后这些变量看起来更像正态分布。

Bivariate Plots Section

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

可以看出quality和alcohol(0.48),volatile acidity(-0.39),sulphates(0.25)和citric acid(0.23)相关性比较大

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.580  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

可以看出高quality的红酒相应的alcohol也高。除了quality为5的红酒,其他红酒的alcohol的中位数呈现升高的趋势,而且quality为5的红酒的异常值有很多。我觉得可能是样本的错误。

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

可以看出,volatile.acidity和quality呈现负相关。随着quality的提高,volatile.acidity的中位数相应的降低,但quality为7,8的变化不明显。总的来说,好的红酒volatile.acidity比较低。

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

可以看出随着quality的提高,sulphates也相应提高。但quality为5,6的样本中出现很多的异常值,也许是由于样本的错误,所以我们不能说sulphates和quality有相关性,只能说sulphates可能对红酒口味有影响。

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

可以看出随着红酒quality的提高,citric.acid也相应提高,他们是正相关的。一个有趣的现象,quality为3,4的,quality为5,6,quality为7,8的中位数很接近。

## <ScaleContinuousPosition>
##  Range:  
##  Limits: 0.994 --    1

## <ScaleContinuousPosition>
##  Range:  
##  Limits: 3.06 -- 3.57

可以看到,density,pH,fixed.acidity和quality之间也有相关性,quality高的红酒相应的fixed.acidity也高,quality高的红酒相应的density和pH低

从相关性矩阵,可以看出其他非quality变量直接也有相关性 1. Fixed acidity vs citric acid (0.67) 2. Volatile acidity vs citric acid (-0.55) 3. Fixed acidity vs density (0.67) 4. Fixed acidity vs pH (-0.68) 5. Citric acid vs pH (-0.54) 6. Free sulfur dioxide vs total sulfur dioxide (0.67)

散点图显示了fixed acidity和citric acid有强烈的正相关,一个增加另外一个增加;Volatile acidity和citric acid有负相关,一个增加另外一个减少;density和fixed.acidity有着强烈的正相关,一个增加另外一个增加。

pH和fixed acidity以及citric acid之间存在负相关,一个增加另外一个减少, 这个符合酸性的常识。

total sulfur dioxide和free sulfur dioxide正相关,以为total sulfur dioxide包含了free sulfur dioxide, 所以一个增加另外一个也增加。

## 
##  Pearson's product-moment correlation
## 
## data:  wine$chlorides and wine$sulphates
## t = 15.978, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3282127 0.4127694
## sample estimates:
##       cor 
## 0.3712605

## 
##  Pearson's product-moment correlation
## 
## data:  chlorides and sulphates
## t = -2.0202, df = 1466, p-value = 0.04354
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.103573750 -0.001532528
## sample estimates:
##         cor 
## -0.05269068

可以看出chlorides和sulphates不是真的相关。他们的相关系数是0.37,但是删除5%的异常值后,相关系数变成了-0.05

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  1. Quality和其他变量的相关系数分别为alcohol(0.48),volatile acidity(-0.39), sulphates (0.25),citric acid (0.23)
  2. 高质量的红酒含有酒精值也更高
  3. 高质量的红酒有更低的volatile acidity
  4. Quality和sulphates貌似有正相关,但是当Quality为5时出现了很多异常值
  5. 低Quality(3,4)的红酒citric acid含量很低;中等Quality(5,6)的红酒大约0.25 g/dm^3的citric acid;高Quality(7,8)的红酒citric acid含量超过0.25 g/dm^3。
  6. 高Quality的红酒含有的density和pH更低。

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

高fixed acidity的红酒citric acid也高,更高的citric acid相应的红酒质量更高。volatile acidity和fixed acidity负相关,高volatile acidity的红酒导致红酒的quality更低。

What was the strongest relationship you found?

红酒的quality和alcohol有着最强的相关性,从boxplot看出,alcohol越高,红酒的quality越高。

Multivariate Plots Section

上面的boxplot解释了citric acid和volatile acidity在不同的quality下之间的关系。每一类quality,citric acid和volatile acidity都是负相关。说明了下面两点 1. 高quality的红酒有更低的volatile acidity 2. 对于每一类的quality,citric acid和volatile acidity负相关

可以看出高质量的红酒citric.acid和fixed.acidity之间的比例接近0.05

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

在不同的quality分类下,citric acid和volatile acidity之间的关系进一步增强了。在每一类的quality下面,citric acid和volatile acidity都是负相关。使用citric acid和volatile acidity的线性模型用来预测quality。

Were there any interesting or surprising interactions between features?

citric acid和fixed acidity的比例,对于红酒的quality是一个很好的参考。高quality的红酒这个比例接近0.05。

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

citric acid出现多峰分布,有三个峰值出现在0, 0.25和0.5。样本含有大量的0值。

Plot Two

Description Two

高quality的红酒有更高的citric acid和更低的volatile acidity,citric acid和volatile acidity呈负相关。 可能的原因是citric acid和volatile acidity在某种条件下会互相转换。

Plot Three

Description Three

从图上可以看出,当volatile acidity大于1时,红酒的品质就不可能为excellent。当volatile acidity为0或者0.3时,红酒的品质有40%的可能性为excellent。但是当volatile acidity在1和1.2之间时,红酒的品质有80%的可能性为bad。然而当volatile acidity大于1.4时,红酒的品质100%是bad。因此volatile acidity是好的特征来检验红酒的品质是否为bad。


Reflection

这数据集包含了1599个样本,每个样本12个变量。首先为了找出变量的分布,我检查了每个变量的直方图。对于一些变量,我发现了一些有趣的分布,例如citric acid的多峰分布。接下来通过协方差矩阵,我探索了所以变量和quality之间的关系。一些变量和quality之间存在着较强的相关性,例如Alcohol, Volatile Acidity, Sulphates和Citric Acid。绘制了这些变量和quality的box plot。我认为citric acid表现出多峰分布,所以我对它产生了兴趣。下面聚焦在citric acid, fixed acidity和volatile acidity之间的关系。最后得出关于citric acid和fixed acidity之间比例,citric acid和volatile acidity负相关,Volatile acidity可以决定坏的quality的结论。

在这个数据集中一个明显的缺陷就是依赖于品尝者的主观偏好。例如,一些专家倾向于某种方法来判断红酒的quality(干/甜度)。普通人一般不知道这些方法,我想知道普通人如果用这些方法来判断红酒会和专家产生多大的区别。我建议选择不同品尝者的人群会让数据更有说服力。